Open GiLAN Testbed Results

The graphs summaries the NFV performance metrics

General Workload Chacterisation

Skip to results

The system run different processes depending on the applications running. The operations of these applications and their respective processes is execute through system calls. There are wide range of system calls that can be run by the OS. In general the frequent types of system calls can provide a general chaterisationof the workload running on the OS. The workload charaterisation is a good starting point in understanding the system or applications running the system. In addition to the frequent system calls, details on the processes making syscalls is helpful in understanding the system. The latency of the both the system calls and the processes making system calls is a starting point in understand the latency of the system as a whole. From these results we can go further to look at the performance results of the different compute resources. The chaterisations helps in knowing the compute results to focus on e.g., if there is a load or read syscalls we can focus on Filesystem and cache.

CPU

Skip to results

The CPU is responsible for executing all workloads on the NFV. Like other resources, the CPU is managed by the kernel. The user-level applications access CPU resources by sending system calls to the kernel. The kernel also receives other system call requests from different processes; memory loads and stores can issue page faults system calls. The primary consumers of CPU resources are threads (also called tasks), which belong to procedures, kernel routines and interrupt routes. The kernel manages the sharing via a CPU scheduler.

There are three thread states: ON-PROC for threads running on a CPU, RUNNABLE for threads that could run but are waiting their turn, and SLEEP for blocked lines on another event, including uninterruptible waits. These can be categorised into two for more accessible analysis, on-CPU referring to ON-PROC, and off-CPU referring to all other states, where the thread is not running on a CPU. Lines leave the CPU in one of two ways: (1) voluntary if they block on I/O, a lock, or asleep, or (2) involuntary if they have exceeded their scheduled allocation of CPU time. When a CPU switches from running one process or thread to another, it switches address spaces and other metadata. This process is called context switching; it also consumes the CPU resources. All these processes, described, in general, consume the CPU time. In addition to the time, another CPU resource used by the methods, kernel routines and interrupts routines is the CPU cache.

There are typically multiple levels of CPU cache, increasing in both size and latency. The caches end with the last-level store (LLC), large (Mbytes) and slower. On a processor with three levels of supplies, the LLC is also the Level 3 cache. Processes are instructions to be interpreted and run by the CPU. This set of instructions is typically loaded from RAM and cached into the CPU cache for faster access. The CPU first checks the lower cache, i.e., L1 cache. If the CPU finds the data, this is called a hit. If the CPU does not see the data, it looks for it in L2 and then L3. If the CPU does not find the data in any memory caches, it can access it from your system memory (RAM). When that happens, it is known as a cache miss. In general, a cache miss means high latency, i.e., the time needed to access data from memory.

Memory

Skip to results

The kernel and processor are responsible for mapping the virtual memory to physical memory. For efficiency, memory mappings are created in groups of memory called pages. When an application starts, it begins with a request for memory allocation. In the case that there is no free memory on the heap, the syscall brk() is issued to extend the size of the bank. However, if there is free memory on the heap, a new memory segment is created via the mmap() syscall. Initially, this virtual memory mapping does not have a corresponding physical memory allocation. Therefore when the application tries to access this allocated memory segment, the error called page fault occurs on the MMU. The kernel then handles the page fault, mapping from the virtual to physical memory. The amount of physical memory allocated to a process is called resident set size (RSS). When there is too much memory demand on the system, the kernel page-out daemon (kswapd) may look for memory pages to free. Three types of pages can be released in their order: pages that we read but not modified (backed by disk) these can be immediately rid, pages that have been modified (dirty) these need to be written to disk before they can be freed and pages of application memory (anonymous) these must be stored on a swap device before they can be released. kswapd, a page-out daemon, runs periodically to scan for inactive and active pages with no memory to free. It is woken up when free memory crosses a low threshold and goes back to sleep when it crosses a high threshold. Swapping usually causes applications to run much more slowly.

Filesytem

Skip to results

The file system that applications usually interact with directly and file systems can use caching, read-ahead, buffering, and asynchronous I/O to avoid exposing disk I/O latency to the application. Logical I/O describes requests to the file system. If these requests must be served from the storage devices, they become physical I/O. Not all I/O will; many logical read requests may be returned from the file system cache and never become physical I/O. File systems are accessed via a virtual file system (VFS). It provides operations for reading, writing, opening, closing, etc., which are mapped by file systems to their internal functions. Linux uses multiple caches to improve the performance of storage I/O via the file system. These are Page cache: This contains virtual memory pages and enhances the performance of file and directory I/O. Inode cache, which are data structures used by file systems to describe their stored objects. The directory cache caches mappings from directory entry names to VFS inodes, improving the performance of pathname lookups. The page cache grows to be the largest of all these because it caches the contents of files and includes “dirty” pages that have been modified but not yet written to disk.

Disk I/O

Skip to results

Linux exposes rotational magnetic media, flash-based storage, and network storage as storage devices. Therefore, disk I/O refers to I/O operations on these devices. Disk I/O is a common source of performance issues because I/O latency on storage devices is orders of magnitude slower than the nanosecond or microsecond speed of CPU and memory operations. Block I/O refers to device access in blocks. I/O is queued and scheduled in the block layer. The wait time is spent in the block layer scheduler queues and device dispatcher queues from the operating system. Service time is the time from device issue to completion. This may include the time spent waiting in an on-device line. Request time is the overall time from when an I/O was inserted into the OS queues to its completion. The request time matters the most, as that is the time that applications must wait if I/O is synchronous.

Networking

Skip to results

Networking is a complex part of the Linux system. It involves many different layers and protocols, including the application, protocol libraries, syscalls, TCP or UDP, IP, and device drivers for the network interface. In general, the Networking system can be broken down into four. The NIC and Device Driver Processing first reads packets from the NIC and puts them into kernel buffers. Besides the NIC and Device driver, this process includes the DMA and particular memory regions on the RAM for storing receive and transmit packets called rings and the NAPI system for poling packets from these rings to the kernel buffers. It also incorporates some early packet processing hooks like XDP and AF\_XDP and can have custom drivers that bypass the kernel (i.e., the following two processes) like DPDK. Following is the Socket processing. This part also includes queuing and different queuing disciplines. It also incorporates some packet processing hooks like TC, Netfilter etc., which can alter the flow of the networking stack. After that is the Protocol processing layer, which applies functions for different IP and transport protocols, both these protocols run under the context of SoftIrq. Lastly is the application process. The application receives and sends packets on the destination socket

Flame Graphs to analyse code paths

Skip to results

A flame graph visualizes a distributed request trace and represents each service call that occurred during the requests execution path with a timed, color-coded, horizontal bar. Flame graphs for distributed traces include error and latency data to help developers identify and fix bottlenecks in their applications..

Syscalls across the system.

Analysing system calls (syscalls) across the system helps in categorising the workload of the system. This information is valuable in identifying the hardware resources that require optimisation, such as installing an accelerated network card interface (NIC) or a cryptographic accelerator




Syscalls across the system.



Processes making syscalls

The information about the processes that make system calls provides valuable insights into the most active processes during the registration procedure. By observing the changes in latency and frequency of the system calls made by a process as the number of UEs increases, we can identify processes that have a high probability of becoming bottlenecks. The information can be used to make several mitigation decisions, such as allocating more resources or dedicated resources to a given process or Network Function (NF), optimising the usage by the NF or process, and examining the configuration of the process, among other things.






Process free5gc


Process oai


Process open5gs


epoll/poll/select


The system calls epoll/poll/select implement I/O multiplexing, which enables the simultaneous monitoring of multiple input and output sources in a single operation. These system calls are based on the Linux design principle, which considers everything as a file and operates by monitoring files to determine if they are ready for the requested operation. The main advantage of multiplexing I/O operations is that it avoids blocking read and write where a process will wait for data while on the CPU. Instead, one waits for the multiplexing I/O system calls to determine which files are ready for read or write.



















































read/write


The read() system call is used to retrieve data from a file stored in the file system, while the write() system call is used to write data from a buffer to a file. Both system calls take into account the "count", which represents the number of bytes to read or write. Upon successful execution, these system calls return the number of bytes that were successfully read or written. By default, these system calls are blocking but can be changed to non-blocking using the fnctl system call. Blocking is a problem for programs that should operate concurrently, since blocked processes are suspended. There are two different, complementary ways to solve this problem. They are nonblocking mode and I/O multiplexing system calls, such as select and epoll. The architectural decision to use a combination of multiplexing I/O operations and non-blocking system calls offers advantages depending on the use cases. Some scenarios where this approach is beneficial include situations where small buffers would result in repeated system calls, when the system is dedicated to one function, or when multiple I/O system calls return an error.
























recv, recvfrom, recvmsg, recvmmsg. recvfrom(), recvmsg() and recvmmsg()


These are all system calls used to receive messages from a socket. They can be used to receive data on a socket, whether or not it is connection-orientated. These system calls are blocking calls; if no messages are available at the socket, the receive calls wait for a message to arrive. If the socket is set to non-blocking, then the value -1 is returned and errno is set to EAGAIN or EWOULDBLOCK. Passing the flag MSG_DONTWAIT to the system call enables non-blocking operation. This provides behaviour similar to setting O_NONBLOCK with fcntl except MSG_DONTWAIT is per operation. The recv() call is normally used only on a connected socket and is identical to recvfrom() with a nil from parameter. recv(), recvfrom() and recvmsg() calls return the number of bytes received, or -1 if an error occurred. For connected sockets whose remote peer was shut down, 0 is returned when no more data is available. The recvmmsg() call returns the number of messages received, or -1 if an error occurred
























send, sendto, sendmsg, sendmmsg


The send() call may only be used when the socket is in a connected state (so that the intended recipient is known). The send() is similar to write() with the difference of flags. The sendto and sendmsg work on both connected and unconnected sockets. The sendmsg() call also allows sending ancillary data (also known as control information). The sendmmsg() system call is an extension of sendmsg that allows the caller to transmit multiple messages on a socket using a single system call. The approaches to optimise the send(s) system calls are similar to the discussed approaches for the recv(s) system calls. These include I/O multiplexing, using the system calls in non-blocking mode, and sending multiple messages in a single system call where possible
























nanosleep/clock_nanosleep


The nanosleep and clock_nanosleep system calls are used to allow the calling thread to sleep for a specific interval with nanosecond precision. The clock_nanosleep differs from nanosleep in two ways. Firstly, it allows the caller to select the clock against which the sleep interval is to be measured. Secondly, it enables the specification of the sleep interval as either an absolute or a relative value. Using an absolute timer is useful to prevent timer drift issues mentioned about nanosleep.
























futex


The futex() system call offers a mechanism to wait until a specific condition becomes true. It is typically used as a blocking construct in the context of shared-memory synchronisation. Additionally, futex() operations can be employed to wake up processes or threads that are waiting for a particular condition. The main design goal of futex is to manage the mutex keys in the user space to avoid context switches when handling mutex in kernel space. In the futex design, the kernel is involved only when a thread needs to sleep or the system needs to wake up another thread. Essentially, the futex system call can be described as providing a kernel side wait queue indexed by a user space address, allowing threads to be added or removed from user space. A high frequency of calls to the futex system may indicate a high degree of concurrent access to shared resources or data structures by multiple threads or processes.















sched_yield


The sched_yield system call is used by a thread to allow other threads a chance to run, and the calling thread relinquishes the CPU. Strategic calls to sched_yield() can improve performance by giving other threads or processes an opportunity to run when (heavily) contended resources, such as mutexes, have been released by the caller. The authors of were able to improve the throughput of their system by employing the sched_yield system call after a process processes each batch of packets before calling the poll. On the other hand, sched_yield can result in unnecessary context switches, which will degrade system performance if not used appropriately. The latter is mainly true in generic Linux systems, as the scheduler is responsible for deciding which process runs. In most cases, when a process yields, the scheduler may perceive it as a higher priority and still put it back into execution, where it yields again in a loop. This behaviour is mainly due to the algorithm and logic used by Linux’s default scheduler to determine the process with the higher prior